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Abstruct— We propose an optimization method of mutual learning which con- 
verges into the identical state of optimum ensemble learning within the framework 
of on-line learning, and have analyzed its asymptotic property through the statistical 
mechanics method. The proposed model consists of two learning steps: two students in- 
dependently learn from a teacher, and then the students learn from each other through 
the mutual learning. In mutual learning, students learn from each other and the gen- 
eralization error is improved even if the teacher has not taken part in the mutual 
learning. However, in the case of different initial overlaps (direct ion cosine) between 
teacher and students, a student with a larger initial overlap tends to have a larger 
generalization error than that of before the mutual learning. To overcome this prob- 
lem, our proposed optimization method of mutual learning optimizes the step sizes of 
two students to minimize the asymptotic property of the generalization error. Conse- 
quently, the optimized mutual learning converges to a generalization error identical to 
that of the optimal ensemble learning. In addition, we show the relationship between 
the optimum step size of the mutual learning and the integration mechanism of the 
ensemble learning. 

Keywords— mutual learning, learning step size, on-line learning, linear percep- 
tron, statistical mechanics 



1 Introduction 

As a model involving the interaction between students, Kinzel proposed mutual 
learning within the framework of on-line learning^ [TUl [TT]. Kinzel's model 
employs two students, and a student learns with the other student acting as a 
teacher. The target of his model is to obtain the same networks through the 
learning. On the other hand, ensemble learning algorithms, such as bagging[lj 
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and Ada-boost [5] , try to improve upon the performance of a weak learning ma- 
chine by using many weak learning machines; such learning algorithms have 
recently received considerable attention. We have noted, however, that the 
mechanism of integrating the outputs of many weak learners in ensemble learn- 
ing is similar to that of obtaining the same networks through mutual learning. 

From the point of view of the learning problem, how the student approaches 
the teacher is important. However, Kinzel[9l [10l [11] does not deal with the 
teacher-student relation since a teacher is not employed in his model. In contrast 
to Kinzel's model, we have proposed mutual learning between two students who 
learn from a teacher in advance |12j. In our previous work[T2"]. we showed that 
the generalization error of the students becomes smaller through the mutual 
learning even if the teacher does not take part in the mutual learning. We also 
showed that a student with a larger initial overlap (direction cosine) for mutual 
learning transiently passes through a state of the optimum ensemble learning 
when the limit of the learning step size is zero. 

In this paper, we propose a new mutual learning algorithm that uses a dif- 
ferent learning step size for each student. We analyze the asymptotic property 
of the proposed learning algorithm through the statistical mechanics method, 
and propose an optimization method for the learning step size. By using the 
optimum learning step size, we can obtain the optimum asymptotic property of 
the generalization error through mutual learning. The proposed method is an 
expansion of our previous work[T2"]. 

In this paper, we assume that each teacher and student is a linear pcrccp- 
tron. An on-line lcarning|3j scheme is employed. In the proposed method, two 
students individually learn from a teacher during initial learning, and then they 
learn from each other during mutual learning. Therefore, we assume the over- 
laps between teacher and students are not zero at the initial state of mutual 
learning. In the mutual learning, each student learns from the other as the 
teacher. Since a teacher is not used in the mutual learning, we refer to a latent 
teacher in this paper. 

In Section 2, we formulate latent teacher, student, and mutual learning al- 
gorithms. In Section 3, we derive differential equations of the order parameters 
that depict the dynamics of mutual learning. We employ different learning step 
sizes for each student. We then derive the generalization error by using the 
order parameters. In Section 4, we solve the differential equations with different 
learning step sizes, and then analyze the effect of the learning step size on the 
asymptotic property of the mutual learning. After that, we obtain the optimum 
ratio of the students' learning step sizes which realizes the minimum generaliza- 
tion error. Moreover, we discuss the relation between the learning step size of 
mutual learning and the integration mechanism of ensemble learning. 
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Figure 1: Network structure of latent teacher and student networks, all having 
the same network structure. 



2 Formulation of mutual learning with a latent 
teacher 

In this section, we formulate the latent teacher and student networks, and the 
mutual learning algorithms. We assume the latent teacher and student networks 
receive iV-dimensional input x(m) — {x\{m) 1 . . . ,XAr(m)) at the m-th learning 
iteration as shown in Fig. [1] Learning iteration to is ignored in the figure. 
The latent teacher network is a linear perceptron, and the student networks are 
two linear perceptrons. We also assume that the elements Xi{m) of the indepen- 
dently drawn input x{m) are uncorrelated random variables with zero mean and 
1/N variance; that is, the elements are drawn from a probability distribution 
P(x). In this paper, the thermodynamic limit of N — > oo is assumed. The size 
of input vector \x\ then becomes one. 

( Xi ) = o, ((0 2 > = ^, 1*1 = 1, (i) 

where (• • • ) denotes average, and | • | denotes the norm of a vector. 

The latent teacher network is a linear perceptron, and is not subject to train- 
ing. Thus, the weight vector is fixed in the learning process. The output of the 
latent teacher v(m) for iV-dimensional input x(m) = {x\ (to), a^fro), . . . , xjv(w)) 
at the TO-th learning iteration is 

JV 

v{m) = y^^BjXi(m) = B ■ x(m), (2) 

i=l 

B = (Bi, B 2 , ■ ■ ■ , Bff), (3) 

where latent teacher weight vector B is an iV-dimensional vector like the input 
vector, and each element Bi of the latent teacher weight vector B is drawn 
from a probability distribution of zero mean and unit variance. Assuming the 
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thermodynamic limit of N — > oo, the size of latent teacher weight vector \B\ 
becomes y/N. 

<£,)=0, {(B l f)=l 1 \B\=VN. (4) 

The output distribution for the latent teacher P(v) follows a Gaussian distribu- 
tion of zero mean and unit variance in the thermodynamic limit of N — ► oo. 

The two linear perceptions are used as student networks that compose the 
mutual learning machine. Each student network has the same architecture as 
the latent teacher network. Each element of J k (0) which is the initial value of 
the fc-th student weight vector J k is drawn from a probability distribution of 
zero mean and unit variance. The norm of the initial student vector | J k (0)\ is 
\/N in the thermodynamic limit of N —* oo, 

(J?(0)) = 0, ((Jf (0)) 2 ) = 1, \J k (0)\=VN. (5) 

The k-th student output Ufe(m) for the iV-dimensional input x(m) is 

N 

Ufc(m) = Ji(m)xi(m) = J k (m) ■ x(m), (6) 
»=i 

J k (m) = (J k ,J k ,...,J k ). (7) 

Generally, the norm of student weight vector \J k (m)\ changes as the time step 
proceeds. Therefore, the ratio Ik of the norm to \/N is considered and is called 
the length of student weight vector J k . The norm at the m-th iteration is 
lk(m)\/N, and the size of Zfc(ro) is O(l). 

|J fe (m)| =l k (m)>/N (8) 

The distribution of the output of the k-th student P(uk) follows a Gaussian 
distribution of zero mean and If, variance in the thermodynamic limit of N — * oo. 

Next, we formulate the learning algorithm. After the students learn from 
a latent teacher, mutual learning is carried out. The learning equation of the 
mutual learning is 

J k (m + 1) = J k {m) + 7] k ( Uk'(m) - u k (m) Wm), (9) 

where k is 1 or 2 and k ^= k' . m denotes the iteration number. Equation ((9J 
shows that mutual learning is carried out between two students. Therefore, 
the teacher used in the initial learning is called a latent latent teacher. We 
use the gradient descent algorithm in this paper, while another algorithm was 
used in Kinzel's work [5], When the interaction between students is introduced, 
the performance of students may be improved if they exchange knowledge that 
each student has acquired from the latent teacher in the initial learning. In 
other words, two students approach each other through mutual learning, and 
tend to move towards the middle of the initial weight vectors. This tendency is 
similar to the integration mechanism of ensemble learning, so mutual learning 
may mimic this mechanism. 
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3 Theory 



In this section, we first derive the differential equations of two order parameters 
which depict the behavior of mutual learning. After that, we derive an auxiliary 
order parameter which depicts the relationship between the latent teacher and 
students. We then rewrite the generalization error using these order parameters. 

We first derive the differential equation of the length of the student weight 
vector Ik- Ik is the first order parameter of the system. We modify the length 
of the student weight vector in Eq. ||5J) as J k ■ J k = Nl\ . To obtain a time 
dependent differential equation of Ik , we square both sides of Eq. ^ . We then 
average the term of the equation using the distribution of P(itfe, Uk')- Note that 
x and J k are random variables, so the equation becomes a random recurrence 
formula. We formulate the size of the weight vectors to be O(N), and the size of 
input x is O(l), so the length of the student weight vector has a self- averaging 
property. Here, we rewrite to as m = Nt, and represent the learning process 
using continuous time t in the thermodynamic limit of N — > oo. We then obtain 
the deterministic differential equation of 

dl 2 

-^ = (4- 2m)ll + rfal 2(% 2 - m )Q. (10) 

Here, k is 1 or 2, and k ^ k' . In this equation, Q = qlkh' and q is the overlap 
between J k and J k , defined as 



Jk ■ Jk' J k ■ J 



k' 



\J k \ \J k '\ NU 



k'k' 



(11) 



and q is the second order parameter of the system. The overlap q also has a 
self-averaging property, so we can derive the differential equation in the thermo- 
dynamic limit of N — > oo. The differential equation is derived by calculating the 
product of the learning equation (Eq. |9])) for J k and J k , and we then average 
the term of the equation using the distribution of P(uk,Uk>)- After that, we 
obtain the deterministic differential equation as 

^- = (rn - + (m - mm^l - im + m- 2mm)Q- (12) 

Equations (fTOf and (fT2|) form closed differential equations. 

The analytical solutions of the length of the student Ik and the overlap 
between students Q are given by 

il(t) = -A^eM-im + - (m + m))t) + (-i) k 2A 2 ^^eM-(vi + m)t) + A 3 , 
Vk' m - vi 

(13) 

Q(t) = A x exp(-(»ft + 772X2 - (ry x + + A 2 exp(-(? 7l + r l2 )t) + A 3 , (14) 
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where 



^(^(0) + Z 2 2 (0)-2Q(0)) 

(J7i + mr 

, fa - 7?i)faf?(0) - tftlf(0) - (r/ 2 - gi)Q(O)) nr . 

A 2 = 7 ; T2 l 



r 1 lll(Q)+i 1 lll{Q) + 2 mm Q{Q) 
A 3 = ■ -2 ■ (17) 

/i(0) is the initial condition of student 1, and ^(0) is that of student 2. Q(0) = 
g(0)Z(0), and q(0) is the initial condition of the overlap between student 1 and 
student 2. From Eqs. (fl3|) and (fT"4|) . l\{t) and Q(t) converge to finite values at 
t — > oo if 2 — (771 + 772) > is satisfied. Then the convergence condition of l\{t) 
and Q(t) is 

7]! + 772 > 2. (18) 

To depict the behavior of mutual learning with a latent latent teacher, we 
have to obtain the differential equation of overlap Rk , which is a direction cosine 
between latent teacher weight vector B and the fc-th student weight vector J k 
defined by Eq. (HHJ). We introduce Rk as the third order parameter of the 
system. 

B ■ J k B ■ J k 

Rk - WW\ - ( 9) 

For the sake of convenience, we write the overlap between the latent teacher 
weight vector and the student weight vector as and rk = Rkh- The differ- 
ential equation of overlap rk is derived by calculating the product of B and 
Eq. ([9|) , and we then average the term of the equation using the distribution of 
P(u, Uk, Uk')- The overlap rk also has a self-averaging property, and in the ther- 
modynamic limit the deterministic differential equation of is then obtained 
through a calculation similar to that used for Ik- 

^=Vk(r«-r k ) (20) 

The solution for overlap rk is obtained by solving simultaneous differential equa- 
tions of Eq. (J20j) for k = 1 and k' = 2, and for k = 2 and k' = 1. 

Vk(rk(0)-r k '(0)) Wi(0)+ ^(O) 

r k {t) = ■ exp (- 771 + 772 )t H ■ , (21) 

m +m Vi + 772 

where rfe(0) = i?fe(0)Z(0), and Rk(0) is the initial overlap between the latent 
teacher and the fc-th student. 

The squared error for the k-th student e k is then defined using the output 
of the latent teacher and that of the student as given in Eqs. ([2|) and (J6j> , 
respectively. 

e fe = Ub-x- J k (22) 
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The generalization error for the k-th student e k is given by the squared error e k in 
Eq. (|22p averaged over the possible input x drawn from a Gaussian distribution 
P(x) of zero mean and 1/N variance. 
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J dxP(x) e k (23) 

= ^JdxP{x)(B-x-J k - x y. (24) 

This calculation is the iV-th Gaussian integral with x and it is hard to calculate. 
To overcome this difficulty, we employ coordinate transformation from x to v 
and Uk in Eqs. ([2]) and ([6]). Note that the distribution of the output of the 
students P{uk) follows a Gaussian distribution of zero mean and l\ variance 
in the thermodynamic limit of N — > oo. For the same reason, the output 
distribution for the latent teacher P(v) follows a Gaussian distribution of zero 
mean and unit variance in the thermodynamic limit. Thus, the distribution 
P(v,Uk) of latent teacher output v and the k-th student output Uk is 



(f,M fc ) T i; 1 (v,uk) 



(25) 



E =U3) (26) 

Here, T denotes the transpose of a vector, r k denotes — Rkh, and Rk is 
the overlap between the latent teacher weight vector B and the student weight 
vector J k defined by Eq. fT9")) . Hence, by using this coordinate transformation, 
the generalization error in Eq. (|24p can be rewritten as 

e Q = o / dvdu k {v - u k ) 2 (27) 



= ~(l-2r fc + £). (28) 

Consequently, we calculate the dynamics of the generalization error by substi- 
tuting the time step value of h(t), Q(t), and Tk{t) into Eq. (|28|) . 

■ 2 *M0)-MQ)) eM _ (m + )t) _ 2^(0) + ^(0) 

f ;u n ; 2V ; exp -( m + % 2 - m + Tft))*) 

(m + m) 

-(-!) 7 ; exp - ?/i + ?72W + 7 ; ^ 

(29) 
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4 Results 



When the step sizes of two students are the same, the mutual learning asymp- 
totically converges to the average weight vector of two students [12]. In this 
section, we analyze the asymptotic property of mutual learning in the case of 
different step sizes, and then discuss the relationship between mutual learning 
and ensemble learning. 

4.1 Effect of step size on the asymptotic property of mu- 
tual learning 

We analyze the effect of the learning step size on the asymptotic property of mu- 
tual learning. Two students use different learning step sizes. For this purpose, 
we use computer simulations. 

Figure [2] shows trajectories of the student weight vectors when the initial 
overlaps between the latent teacher and the students were inhomogeneous: (a) 
shows the results obtained through setting the learning step size of student 1 
(r)i) to O.l(fixed), and setting the learning step size of student 2 (772) to 0.1, 0.2, 
0.3, or 0.5; (b) shows the results obtained through setting the learning step size 
rji to O.Ol(fixed), and setting 772 to 0.01, 0.02, 0.03, or 0.05. In these figures, 
the horizontal axis shows the length of the student weight vector l^, and the 
vertical axis shows the overlap R^. The initial conditions were /i(0) = /2(C) = 1, 
i?i(0) = 0.6, i? 2 (0) = 0.2, and q(Q) = -0.2. The theoretical results obtained 
using Eqs. (|13[) . ([M]) . and ([2Tj) are shown as thick lines, and the results obtained 
through computer simulations for N — 10000 are shown as thin lines. The 
upper lines show trajectories of the weight vector of student 1, and the lower 
lines show trajectories of the weight vector of student 2. The symbols of black 
rectangles show convergence points of trajectories of the student weight vectors. 
The numbers above the symbols show the learning step sizes of student 2. 

When the learning step sizes 771 and 772 were the same, student 1 started at 
£i(0) = 1 and i?i(0) = 0.6, and converged to the average weight vector of the 
initial student vectors denoted by AW. Student 2 started at ^(0) = 1 and 
^2(0) = 0.2, and also converged to the average weight vector denoted by AW 
when using the same learning step sizes. 

When the learning step sizes 771 and 772 were not the same, the convergence 
points were changed by using a different step size 772 of 0.2, 0.3, or 0.5 as shown 
in Fig. [2ja). As in Fig. [2ja), Fig. [2jb) shows that the convergence points were 
changed by using a different step size 772 of 0.02,0.03, or 0.05. Note that the 
convergence points for the same ratio of the learning step size tend to be the 
same. Thus, we pay attention to the effect of the ratio of learning step sizes 
772/771 in the mutual learning. 

Figure [3] shows the learning step size dependence of the generalization error. 
The learning step size of student 1 was 0.1 or 0.01 (fixed), and that of student 
2 was changed as shown in the figure. The horizontal axis shows the ratio of 
learning step sizes 772 / 771 , and the vertical axis shows the asymptotic property of 
the generalization error e g . The asymptotic property of the generalization error 
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Figure 2: Trajectories of student weight vector for the inhomogeneous case. The 
initial conditions were 1(0) = 1, i?i(0) = 0.6, R 2 (0) = 0.2, and q(0) = -0.2. (a) 
Results of setting the learning step size to rji = O.l(fixed) and 772 = 0.1,0.2,0.3, 
or 0.5. (b) Results of setting the learning step size to r\\ = 0.01 (fixed) and 
772 = 0.01,0.02,0.03, or 0.05. 



is obtained using Eq. ([29]) for the case of t — ► 00. The results show that the 
asymptotic property of the generalization error was minimized when the ratio 
772/771 was 2. Consequently, the asymptotic property of the generalization error 
can be minimized by using the optimal ratio of learning step sizes. Next, we will 
obtain this optimal ratio of learning step sizes that minimizes the asymptotic 
property of the generalization error. 

4.2 Optimization of the asymptotic property of the gen- 
eralization error 

We now analyze the asymptotic property of the generalization error based on 
the ratio of learning step sizes, and then we obtain the optimum ratio of learning 
step sizes 772/771 that minimizes the asymptotic property of the generalization 
error. 

The asymptotic property of the order parameters is obtained by substituting 
t -> 00 into Eqs. (T3J), dHJ) , and (|2ll): 



12^ _ ,a,_, _ nt^ _ ^i(0) + vM<>) + 27 7 i7 ?2 Q(0) 

(vi + m) 



ill \ ill \ rU \ /21\/ /l A / ' if /l / /on v 

if (00) = l 2 (oo) = Q(oo) = J2 , , (30) 



n(oo) = r 2 (oo) = — ^— n(0) + — ^r 2 (0). (31) 

r)i + 772 m + V2 

The above equations show that the mutual learning converges to the internal 
dividing point of the initial student weight vectors. Using Eqs. (|3"0|) and (f3"Tj) , 
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Figure 3: Relation between learning step size and generalization error. The 
learning step size of student 1 was 0.1 or O.Ol(fixed), and that of student 2 was 
changed. The generalization error is minimized when the ratio of the learning 
step size is two for both cases. The optimum ratio is independent of the size of 
the learning step size. 



we can obtain the asymptotic property of the generalization error: 

, x If, ^riW + rWO) 7?ffi(0) + 7# 2 (0) + 27 ?1 7 te Q(0) 1 
e„(oo) = - < 1 — 2 — > (32) 

9 2\ VI +m ivi+m) 2 J 

We rewrite the generalization error by replacing the ratio 772/^1 with a: 
if q a r,(0)+r 2 (0) a 2 / 2 (0)+l 2 (0) + 2 a Q(0) ) 

e * (oo) = 2-j 1 - 2 ^Ti — + (a-TTr /■ (33) 

When the generalization error is minimized, de g (oo) / da = is satisfied, so 

de g _ 2ol 2 (0) + 2Q(0) 2(q 2 l 2 (0) + Z|(Q) + 2qQ(0)) 2(an(0) + r 2 (0)) 2n(0) 

9a ~ (a + 1) 2 (a + 1) 2 + (a + l) 2 a+1 

(34) 

Solving Eq. (fM)) . we obtain a opt as 

opt = ^(0)-9(0)+ri(0)-r 2 (0) 

Z 2 (0)-Q(0)-n(0) + r 2 (0)- 1 J 

Therefore, the optimum ratio of the learning step size is obtained through Eq. 
(|35[) . The optimum asymptotic property of the generalization error is obtained 
by substituting Eq. Q into Eq. (|53]l: 



= 



ef (00) = ~ {1 - 2(wi(0) + (1 - K)ra(0)) + n 2 l\{Q) + (1 - k) 2 Z 2 (0) + 2/s(l - «)Q(0)} . 

(36) 
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Here, n is defined asK = a opt /(l + a opt ). 

On the other hand, we can consider the linear combination of the initial 
weight vectors of the students — that is, J = CJ 1 (0) + (1 — C)J 2 (0) — and 
minimize the generalization error by C . This is an ensemble learning with two 
students, so from the appendix, the optimum C* that minimizes the generaliza- 
tion error is 

r * = l 2 2 (0)-Q(0) + n(0)-r 2 (0) 

^ ( 0) + Z 2 2 (0)-2Q(0) • {6 '> 

Therefore, the optimum ratio C*/(l — C*) is obtained as 

C* im-Q(0) + ri(0)-r 2 (0) = ^_ 
1-C* ; 2 (0)-Q(0)-n(0) + r 2 (0) iff 1 ] 

and it is shown that the optimum ratio of the learning step size of mutual 
learning a opt = r)2 Pt /rj° pt is equal to that of the optimum linear combination of 
the initial weight vectors C*/(l — C*). Consequently, mutual learning using an 
optimum ratio of learning step sizes converges to the optimum ensemble learning 
that is the linear combination of the initial student vectors. 



5 Conclusion 

We have proposed an optimization method for mutual learning by means of mini- 
mizing the asymptotic property of the generalization error within the framework 
of on-line learning. We first formulated mutual learning with a latent teacher, 
and then derived the differential equations of order parameters that depict the 
learning process. The order parameters of mutual learning are the length of 
the student weight vector Ik and the overlap between students q. To depict 
the relationship between the latent teacher and the students, we introduced the 
order parameter Rk- We derived these differential equations using statistical 
mechanics methods and solved them analytically. After that, we obtained the 
dynamics of the generalization error using these order parameters. 

Next, we used the theoretical results to analyze the relationship between 
the asymptotic property of the mutual learning and the learning step size of the 
students. From the results, we found that the asymptotic property of the mutual 
learning related to the ratio of the learning step sizes of two students, and was 
not related to the learning step size itself. We obtained the optimum ratio of 
the learning step size which minimizes the generalization error analytically. We 
also showed that the optimum ratio of the learning step sizes of the mutual 
learning is equal to the inverse of the ratio of optimum weights for an average of 
the linear combination of initial student weight vectors. We conclude that the 
integration mechanism of ensemble learning can be mimicked through mutual 
learning by introducing the interaction between students. Our future work will 
include analysis of the mutual learning with non-linear perceptrons. 
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A Ensemble learning 

Ensemble learning is a learning method using many weak learning machines 
to improve upon the performance of a single weak learning machine [TJ [51 [5]. 
Students learn from the teacher individually, and then an ensemble output is 
calculated by integrating the students' outputs. Because many students are 
used, ensemble learning is effective when the students differ from each other. 
Therefore, we assume that the overlap (direction cosine) between the kth student 
and the k'th student qkk 1 is not one. The ensemble output of the student 
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networks u is given by the weighted average of each student output using the 
weights for averaging C k : 



K K 

u=j2 c kUk = J2 c k( Jk - x ) ( 39 ) 

fe=l k=l 

Here, the number of students is K and we assume 53fe=i Cfe = ^ n ^ nc follow- 
ing, we assume that the number of students is two. We use linear perceptrons 
as the students, so the average output of the two students is equal to the out- 
put of a perceptron having the average of the two student weight vectors. The 
weighted average of the two student weight vectors J E is defined as follows [T2]. 

J E = C k J k + C k ,J k ' =CJ k + (l-C)J k ' (40) 

Here, we rewrite C k as C and Cy as 1 — C from C k + C k > = 1. From this 
equation, ensemble learning can be viewed as the linear combination of the two 
student weight vectors. Note that ensemble learning is a static process, so there 
is no dynamical property. The length of the weight vector l E and the overlap 
r E are given by 

(l E f = C 2 l 2 k + (1 - Cfl\, + 2(7(1 - C)Q (41) 
r E = Cr k + (1 - C)r k , (42) 

The generalization error of ensemble output e E is given by substituting Eqs. 
|[1IJ) and (g2) into Eq. 



r 
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= i{ 1 - 2(Cr k + (1 - CM + C 2 l\ + (1 - C) 2 ^, + 2C(1 - C)q). (43) 

If the optimum weight for average C* satisfies the condition of de*/dC* = 0, 
we obtain 

r * _ l 2 k ,-Q + r k - ry 

il + il - 2Q [ ] 

When the student weight vector length l k — l k i^l and the overlap between the 
students r k = r k , = r, from Eq. (|44f we obtain C* = (1 — C*) = 1/2 and the 
simple average of the two students is the optimum ensemble output. 
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